Okay, welcome back to deep learning. Happy New Year everybody. It's 2020. So we still
have quite a few exciting lectures coming up this year and today we want to start with
deep reinforcement learning. So today we finally look into how we can train artificial intelligence
systems trying to play games automatically. So this is the idea of deep reinforcement
learning. And in order to introduce that, we will first have a look at sequential decision
making because this is the kind of, we are not facing a single decision right now, but
we want to have a sequence of decisions and these decisions in the beginning they will
be independent, but of course when you're playing a game, every decision has effect
on future decisions and also the future reward. Then we will introduce reinforcement learning.
This is the part where we then really start looking into decisions that are dependent
on each other and how to learn them. Therefore we will introduce the Markov decision process.
So Markov decision processes are essentially key to understand the concepts of reinforcement
learning. Then we talk about policy iteration and other solution methods, how to actually
train those systems. And in the end of the lecture we want to talk about deep reinforcement
learning. So this is now then the deep learning version of reinforcement learning and there
we want to talk a bit about deep Q learning and AlphaGo and AlphaGo Zero. So that's about
the outline of today's talk. So first we start off very basic, very simple with sequential
decision making. And you'll see we'll add on step by step until we can really formulate
an entire game as a learning process. So now if you do sequential decision making and we
want to start with a very simple game, we can start with the so-called multi-armed bandit
problem. So here we have always the constellation that we have some action and this action A
is taken at a time t from a set capital A. So there's a set of actions that you can take
and you choose one of these actions. So there's only a limited number of moves that you can
do at every time t and for every action there's a kind of a reward. Now if you want to keep
this simple we can stay in this constellation like the multi-armed bandit problem where
you can essentially choose one of the arms on one of the slot machines where you can
then pull the lever and then you get some reward. And because we don't know what happens
inside the slot machine we have every action has a different but to us unknown probability
density function that is generating some reward r at time t. So the reward is dependent on
which action we take and this then is what we seek to maximize, right? We want here in
this case we want to make as much money as possible using those slot machines. So in
order to do that we have to define a policy and the policy is what helps us to choose
an action at a certain, yeah, so the policy is essentially telling us which actions to
take. And we can also formulate this as a probability density function. So now we have
actions, we have rewards, and the policy is a probability density function that tells
us which of the actions we want to take. So now we want to win, right? We want to create
the maximum reward over time. So what we want to find is because we don't, for every action
we can't tell the future but we can compute an expected value for that action. So we like
to seek the action that maximizes the expected reward over time. This is what we want to
take. We want to take the action that generates the maximum expected reward over time and
there's a difference to supervised learning. There is no immediate feedback on what action
to choose. We just have this reward and only after we observe those rewards we can actually
see whether this was a good action or not. In supervised learning we would have a class,
you would formulate it in a different way, you would try to identify features and then
classify into one or several classes. Here we don't do that, here we immediately have
the reward and the reward tells us whether it was a good action or not. That's quite
interesting because with these rewards we can then model a lot of situations and we
will see that in the next couple of minutes how to do that. One big problem is of course
this expected value, this is expectation of the maximum reward is not known in advance
so we somehow have to estimate it. So this is one of the key problems and we could form
Presenters
Zugänglich über
Offener Zugang
Dauer
01:25:55 Min
Aufnahmedatum
2020-01-07
Hochgeladen am
2020-01-07 19:09:03
Sprache
en-US